Introduction

Instead of predicting one label (cat, dog, etc.) per image, we will predict one label per pixel!

Each pixel should belong to a class (cat, dog, etc.) or to a background class.

Applications

Autonomous driving Medicine

Representing the task

Similar to how we treat standard categorical values, we’ll create our target by one-hot encoding the class labels - essentially creating an output channel for each of the possible classes.

Models

Note that the model backbone can be a resnet, densenet, inception…

Naive model: Convolutions + Transpose Convolutions (stride=2)

Better model: Convs + TransposeConvs(stride=2) + Residual connections = UNET

History of the styate of the art

Name Description Date Instances
FCN Fully Convolutional Network 2014  
SegNet Encoder-decorder 2015  
Unet Concatenate like a densenet 2015  
DeepLab Atrous Convolution and CRF 2016  
ENet Real-time video segmentation 2016  
PSPNet Pyramid Scene Parsing Net 2016  
FPN Feature Pyramid Networks slides 2016 Yes
DeepLabv3 Increasing dilatation & field-of-view 2017  
LinkNet Adds like a resnet 2017  
DeepLabv3+   2018  
PANet Path Aggregation Network 2018 Yes
Panop FPN Panoptic Feature Pyramid Networks 2019 ?
PointRend Image Segmentation as Rendering 2019 ?

Post-processing (OPTIONAL)

  • Conditional Random Fields (CRF)
  • Grabcut

Metric ands losses

  • Pixel-wise cross entropy
  • IoU (F0): (Pred ∩ GT)/(Pred ∪ GT) = TP / TP + FP * FN
  • Dice (F1): 2 * (Pred ∩ GT)/(Pred + GT) = 2·TP / 2·TP + FP * FN
    • Range from 0 (worst) to 1 (best)
    • In order to formulate a loss function which can be minimized, we’ll simply use 1 − Dice

Pixel-wise cross entropy

Dice loss

Notebook: CAMVID dataset

Reference